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Abstract 


With the goal to generate more scalable algorithms with higher efficiency and fewer open parame- 
ters, reinforcement learning (RL) has recently moved towards combining classical techniques from 
optimal control and dynamic programming with modern learning techniques from statistical esti- 
mation theory. In this vein, this paper suggests to use the framework of stochastic optimal control 
with path integrals to derive a novel approach to RL with parameterized policies. While solidly 
grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi- 
Bellman (HJB) equations, policy improvements can be transformed into an approximation problem 
of a path integral which has no open algorithmic parameters other than the exploration noise. The 
resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, 
depending on how the learning problem is structured. The update equations have no danger of 
numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our 
new algorithm demonstrates interesting similarities with previous RL research in the framework 
of probability matching and provides intuition why the slightly heuristically motivated probability 
matching approach can actually perform well. Empirical evaluations demonstrate significant per- 
formance improvements over gradient-based policy learning and scalability to high-dimensional 
control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog 
illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that 
Policy Improvement with Path Integrals (PI?) offers currently one of the most efficient, numeri- 
cally robust, and easy to implement algorithms for RL based on trajectory roll-outs. 


Keywords: stochastic optimal control, reinforcement learning, parameterized policies 


1. Introduction 


While reinforcement learning (RL) is among the most general frameworks of learning control to cre- 
ate truly autonomous learning systems, its scalability to high-dimensional continuous state-action 
systems, for example, humanoid robots, remains problematic. Classical value-function based meth- 
ods with function approximation offer one possible approach, but function approximation under the 
non-stationary iterative learning process of the value-function remains difficult when one exceeds 
about 5-10 dimensions. Alternatively, direct policy learning from trajectory roll-outs has recently 
made significant progress (Peters, 2007), but can still become numerically brittle and full of open 
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tuning parameters in complex learning problems. In new developments, RL researchers have started 
to combine the well-developed methods from statistical learning and empirical inference with clas- 
sical RL approaches in order to minimize tuning parameters and numerical problems, such that ulti- 
mately more efficient algorithms can be developed that scale to significantly more complex learning 
system (Dayan and Hinton, 1997; Koeber and Peters, 2008; Peters and Schaal, 2008c; Toussaint 
and Storkey, 2006; Ghavamzadeh and Yaakov, 2007; Deisenroth et al., 2009; Vlassis et al., 2009; 
Jetchev and Toussaint, 2009). 

In the spirit of these latter ideas, this paper addresses a new method of probabilistic reinforce- 
ment learning derived from the framework of stochastic optimal control and path integrals, based on 
the original work of Kappen (2007) and Broek et al. (2008). As will be detailed in the sections be- 
low, this approach makes an appealing theoretical connection between value function approximation 
using the stochastic HJB equations and direct policy learning by approximating a path integral, that 
is, by solving a statistical inference problem from sample roll-outs. The resulting algorithm, called 
Policy Improvement with Path Integrals (PI’), takes on a surprisingly simple form, has no open 
algorithmic tuning parameters besides the exploration noise, and it has numerically robust perfor- 
mance in high dimensional learning problems. It also makes an interesting connection to previous 
work on RL based on probability matching (Dayan and Hinton, 1997; Peters and Schaal, 2008c; 
Koeber and Peters, 2008) and motivates why probability matching algorithms can be successful. 

This paper is structured into several major sections: 


e Section 2 addresses the theoretical development of stochastic optimal control with path in- 
tegrals. This is a fairly theoretical section. For a quick reading, we would recommend Sec- 
tion 2.1 for our basic notation, and Table 1 for the final results. Exposing the reader to a 
sketch of the details of the derivations opens the possibility to derive path integral optimal 
control solutions for other dynamical systems than the one we address in Section 2.1. 


The main steps of the theoretical development include: 
— Problem formulation of stochastic optimal control with the stochastic Hamilton-Jacobi- 
Bellman (HJB) equation 
— The transformation of the HJB into a linear PDE 


The generalized path integral formulation for control systems with controlled and un- 
controlled differential equations 


General derivation of optimal controls for the path integral formalism 
— Path integral optimal control applied to special cases of control systems 
e Section 3 relates path integral optimal control to reinforcement learning. Several main issues 
are addressed: 
— Reinforcement learning with parameterized policies 


— Dynamic Movement Primitives (DMP) as a special case of parameterized policies, 
which matches the problem formulation of path integral optimal control. 


— Derivation of Policy Improvement with Path Integrals (PI), which is an application of 
path integral optimal control to DMPs. 


e Section 4 discusses related work. 
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e Section 5 illustrates several applications of PI’ to control problems in robotics. 


e Section 6 addresses several important issues and characteristics of RL with PI’. 


2. Stochastic Optimal Control with Path Integrals 


The goal in stochastic optimal control framework is to control a stochastic dynamical system while 
minimizing a performance criterion. Therefore, stochastic optimal control can be thought as a con- 
strained optimization problem in which the constrains corresponds to stochastic dynamical systems. 
The analysis and derivations of stochastic optimal control and path integrals in the next sections rely 
on the Bellman Principle of optimality (Bellman and Kalaba, 1964) and the HJB equation. 


2.1 Stochastic Optimal Control Definition and Notation 


For our technical developments, we will use largely a control theoretic notation from trajectory- 
based optimal control, however, with an attempt to have as much overlap as possible with the 
standard RL notation (Sutton and Barto, 1998). Let us define a finite horizon cost function for a 
trajectory T; (which can also be a piece of a trajectory) starting at time t; in state x, and ending at 
time! ty 


RO) =O + fn dt, (1) 


with Qy = (x) denoting a terminal reward at time ty and r, denoting the immediate cost at time 
t. In stochastic optimal control (Stengel, 1994), the goal is to find the controls u; that minimize the 
value function: 
V(x;) = V; = min Erz, [R(t:)] , (2) 
Uy;:ty 
where the expectation Er,[.] is taken over all trajectories starting at x;. We consider the rather 
general class of control systems: 


x, = f(x,,t) + G(x;) (u, +E) =f, +G, (w; +E), (3) 


with x, € R”*1 denoting the state of the system, G; = G(x,) € R”*P the control matrix, f; = f(x;) € 
KR”! the passive dynamics, u, € R?*! the control vector and £; € R?*! Gaussian noise with vari- 
ance Le. As immediate cost we consider 


1 
ri = r(Xr, Wt) = qr + 50; Ruy, (4) 


where q; = q(x;,f) is an arbitrary state-dependent cost function, and R is the positive semi-definite 
weight matrix of the quadratic control cost. The stochastic HJB equation (Stengel, 1994; Fleming 
and Soner, 2006) associated with this stochastic optimal control problem is expressed as follows: 


1 
—0,V, = min (r + (VV) TF, + ztrace ((VaxVi)GLeG/ )) (5) 





1. If we need to emphasize a particular time, we denote it by ¢;, which also simplifies a transition to discrete time 
notation later. We use t without subscript when no emphasis is needed when this “time slice” occurs, to for the start 
of a trajectory, and ty for the end of a trajectory. 
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where F; is defined as F, = f(x;,t) + G(x;)u,. To find the minimum, the cost function (4) is 
inserted into (5) and the gradient of the expression inside the parenthesis is taken with respect to 
controls u and set to zero. The corresponding optimal control is given by the equation: 


u(x;) =u, = -R` !GT (V,,V;). 


Substitution of the optimal control above, into the stochastic HJB (5), results in the following 
nonlinear and second order Partial Differential Equation (PDE): 


1 E 1 
—V; = qi + (Vx V) f; — 5 (VV) GR LGT (VxV;) + ztrace ((VixV;)G:ZeG/ ) . 


The Vx and Vxx symbols refer to the Jacobian and Hessian, respectively, of the value function 
with respect to the state x, while ð; is the partial derivative with respect to time. For notational 
compactness, we will mostly use subscripted symbols to denote time and state dependencies, as 
introduced in the equations above. 


2.2 Transformation of HJB into a Linear PDE 


In order to find a solution to the PDE above, we use a exponential transformation of the value 
function: 

V, =—Alog,. 
Given this logarithmic transformation, the partial derivatives of the value function with respect to 


time and state are expressed as follows: 


1 
OV; = hag En 


1 
VxV; = hag Vat ’ 


1 1 
ViuxV; = haga Va Vx — hag Vax Yi- 
Inserting the logarithmic transformation and the derivatives of the value function we obtain: 


À À a 
py oes = Gt — F (VY) f, San: 


1 
— 5(Vx¥,)"G/R7'G/ (Vx¥;,) + strace (T), (6) 
t 2Y; 2 





where the term T is expressed as: 
1 T 1 T 
r = hepa VY, — Ap Vex? G,X_G; . 
The trace of T is therefore: 


1 1 
trace (T) = Mggtrace (Vx) G:LeG;VxP;,) — hag trace (Vix¥:G;LeG/ ) . (7) 
t 
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Comparing the underlined terms in (6) and (7), one can recognize that these terms will cancel 
under the assumption of AR~! = Xg, which implies the simplification: 


AG,R-'G! =G,LeG! = X(x) = £. (8) 


The intuition behind this assumption (cf. also Kappen, 2007; Broek et al., 2008) is that, since the 

weight control matrix R is inverse proportional to the variance of the noise, a high variance control 

input implies cheap control cost, while small variance control inputs have high control cost. From 

a control theoretic stand point such a relationship makes sense due to the fact that under a large 

disturbance (= high variance) significant control authority is required to bring the system back to a 

desirable state. This control authority can be achieved with corresponding low control cost in R. 
With this simplification, (6) reduces to the following form 


1 1 
—0,; = — sa +f) (Vx ¥,) + ztrace ((Vix'¥,)G,LeG/ ) , (9) 


with boundary condition: Y, = exp (—£0,). The partial differential equation (PDE) in (9) corre- 
sponds to the so called Chapman Kolmogorov PDE, which is of second order and linear. Analytical 
solutions of (9) cannot be found in general for general nonlinear systems and cost functions. How- 
ever, there is a connection between solutions of PDEs and their representation as stochastic differ- 
ential equation (SDEs), that is mathematically expressed by the Feynman-Kac formula (Oksendal, 
2003; Yong, 1997). The Feynman-Kac formula (see appendix B) can be used to find distributions 
of random processes which solve certain SDEs as well as to propose numerical methods for solving 
certain PDEs. Applying the Feynman-Kac theorem, the solution of (9) is: 


N1 1 1 tN 
YP, = Ex, (Yue ie tade) — Ex, exp (-7 = S dt ar) . (10) 
ti 


Thus, we have transformed our stochastic optimal control problem into the approximation prob- 
lem of a path integral. With a view towards a discrete time approximation, which will be needed for 
numerical implementations, the solution (10) can be formulated as: 


l 1 N-1 
Y, = a p (Ti|X;) exp -+ (o + 2 wt) dti, (11) 
where 7; = (X;,,.....,Xzy) is a sample path (or trajectory piece) starting at state x,, and the term 


p(t;|x;) is the probability of sample path T; conditioned on the start state x,,. Since Equation (11) 
provides the exponential cost to go ¥, in state x,,, the integration above is taken with respect to 
sample paths T; = (X;,,Xr,,,5-+---»Xry)- The differential term dt; is defined as dt; = (dX;,,.....,dX1y ). 
Evaluation of the stochastic integral in (11) requires the specification of p (1;|x;), which is the topic 
of our analysis in the next section. 


2.3 Generalized Path Integral Formulation 


To develop our algorithms, we will need to consider a more general development of the path integral 
approach to stochastic optimal control than presented in Kappen (2007) and Broek et al. (2008). In 
particular, we have to address that in many stochastic dynamical systems, the control transition 
matrix G; is state dependent and its structure depends on the partition of the state in directly and 
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non-directly actuated parts. Since only some of the states are directly controlled, the state vector 
as T T . : 

is partitioned into x = [x x" ]? with x © KR! the non-directly actuated part and x € 
RK’! the directly actuated part. Subsequently, the passive dynamics term and the control transition 


T (oT T 
matrix can be partitioned as f, = fe!” f° |? with fn € KR !, fe € KR! and G, = (0; p GO JE 


with Go € R!*P, The discretized state space representation of such systems is given as: 
Xni =X, +£,dt + Gr, (udt F vate, ) , 


or, in partitioned vector form: 


(m) (m) (m) 
Xi. \) _ [ X f; Okxp 
sa = ao |+| #a «|| pee L | (u,dt + Vdte,, ). (12) 
(3 | (ib ) (o) (G5) ') 


Essentially the stochastic dynamics are partitioned into controlled equations in which the state 


x() is directly actuated and the uncontrolled equations in which the state x”) is not directly actu- 


ated. Since stochasticity is only added in the directly actuated terms (c) of (12), we can develop 
p(t;|x;) as follows. 
P(TiX;,) = P(Ti+1]X) 
=~ Rata) 
= TSP (Xi IX), 
where we exploited the fact that the start state x,, of a trajectory is given and does not contribute 


to its probability. For systems where the control has lower dimensionality than the state (12), the 
transition probabilities p (xi; leg |x;,) are factorized as follows: 


P (Xr Xr) = P (x) Ix) `P (xix, ) 
Slee L ) 


aa (xy) : (13) 


= 





tj 


where we have used the fact that p Ge) is the Dirac delta function, since x 


computed deterministically from x4” ; rake For all practical purposes,” the transition probability of 
the stochastic dynamics is reduced to the transition probability of the directly actuated part of the 


state: 


(m) can be 
+ 


tj+1 


p (tilk) = Tp (yal) TBSP (x) (14) 


Since we assume that the noise € is zero mean Gaussian distributed with variance Xg, where 
Le € K!*!, the transition probability of the directly actuated part of the state is defined as: 








(c) E 1 ERORO ON 
p (alx) = zyel CERON 


2. The delta functions will all integrate to 1 in the path integral. 
3. For notational simplicity, we write weighted square norms (or Mahalanobis distances) as v?” Mv = ||v||ĝņ. 
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; AT 
where the covariance },, € KR! is expressed as È, = Gi re)” dt. Combining (15) and (14) 
1 


results in the probability of a path expressed as: 
1 N-1 2 
ox exp Ł | ) 
= 1/2 
THY}! ((2m)! 121)" ( 2 ja z; 


Finally, we incorporate the assumption (8) about the relation between the control cost and the vari- 





P (Ti|X;,) 





xf), x0 — far | 


tj 





: T 
ance of the noise, which needs to be adjusted to the controlled space as £,, = Gi re)” dt = 


T T 
AGOR IGO dt = AH,,dt with H,, = G R IGN . Thus, we obtain: 


tj tj 











P (tilx;) ices 1/2 exp 2) 


1 Ee | x 
TG (27) lE) j=i || 


With this formulation of the probability of a trajectory, we can rewrite the the path integral (11) 

















as: 
1 N-1 1yN-1 x a (c) : 
exp | —3 | Ow + Lia Gat +5 Li a a =k a 
H; 
Y, = lim a 
ti di0 IEG; ((27)"/2£; 112) 


; 1 1 orn. \ ale) 
lim | aay? (750) dt; (16) 


where, we defined 


1 Nz! || x®© (c) || 


N-1 _x 
= tin T Xt lc) 
S(t;) = Dry + L q:,dt + 5 L j = ig 

















and 
D(t) = TN} (2z) l 
Note that the integration is over dv z (ax, oats AR): as the non-directly actuated states 


can be integrated out due to the fact that the state transition of the non-directly actuated states is 
deterministic, and just added Dirac delta functions in the integral (cf. Equation (13)). Equation (16) 
is written in a more compact form as: 


ue Date \\ ae) 
Ya = ym, fem (—58(0) lute) Jari 


= lim | exp (7.200) Jaw? (17) 


dt—0 


where Z(t;) = S(t;) +Alog D(t;). It can be shown that this term is factorized in path dependent 
and path independent terms of the form: 
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ss MN —i)l 
Z(G) Say) + ANEU log (27dtàÀ), 
where $(t;) = S(t;) + Ay log |H,,|. This formula is a required step for the derivation of 
optimal controls in the next section. The constant term MNO og (2mdtX) can be the source of 


numerical instabilities especially in cases where fine discretization dt of stochastic dynamics is 
required. However, in the next section, and in a great detail in Appendix A, lemma 1, we show how 
this term drops out of the equations. 


2.4 Optimal Controls 


For every moment of time, the optimal controls are given as u; = —-R'G/ (Vx, V,,). Due to the 
exponential transformation of the value function, the equation of the optimal controls can be written 


as ae 
u, = ARG, SË, 
ti ti Y, 


After substituting Y, with (17) and canceling the state independent terms of the cost we have: 


V0 ( eS ani”) 
u, = lim | AR7'G7 — = 
dt0 i if e~is lT:) dv 





Further analysis of the equation above leads to a simplified version for the optimal controls as 

















u, = f P (ti) wn (ti) at,” | (18) 
with the probability P (t;) and local controls uz (t;) defined as 
-15(T;) 
Pa ek g 
P (ti) = fear, (19) 











uz (Ti) = -R'G,)7 lim (vos) ; 


The path cost $(t;) is a generalized version of the path cost in Kappen (2005a) and Kappen (2007), 
which only considered systems with state independent control transition G,,. To find the local 
controls uz (t;) we have to calculate the limao Vos (ti). Appendix A and more precisely lemma 


2 shows in detail the derivation of the final result: 


lim (v no) = -H;' (Gie, a bi.) ’ 


dt—0 Xt; 


where the new term b,, is expressed as b, = AH, ®, and ®,, € KR’! is a vector with the j” element 


defined as: 
1 = 
(@,,) = zirace (1, : (2x0, )) ; 


4. More precisely if GO = G(® then the term 1 PS log |H;,| disappears since it is state independent and it appears in 





both nominator and denominator in (19). In this case, the path cost is reduced to $(t;) = S(t;). 
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The local control can now be expressed as: 
u(t) =R GOTH! (Gie, —by,) l 


By substituting H, = GOR IGT in the equation above, we get our main result for the local 
controls of the sampled path for the generalized path integral formulation: 





c c -1 c 
uz(t;) =R GT (G! RIG! 2) (ci e, —b,,) (20) 











The equations in boxes (18), (19) and (20) form the solution for the generalized path integral 
stochastic optimal control problem. Given that this result is of general value and constitutes the 
foundation to derive our reinforcement learning algorithm in the next section, but also since many 
other special cases can be derived from it, we summarized all relevant equations in Table 1. 

The Given components of Table 1 include a model of the system dynamics, the cost function, 
knowledge of the system’s noise process, and a mechanism to generate trajectories T;. It is important 
to realize that this is a model-based approach, as the computations of the optimal controls requires 
knowledge of €;. €; can be obtained in two ways. First, the trajectories T; can be generated purely 
in simulation, where the noise is generated from a random number generator. Second, trajectories 
could be generated by a real system, and the noise €; would be computed from the difference be- 
tween the actual and the predicted system behavior, that is, Ge, =X, — Š; = š; — (fn + Gru; ). 
Computing the prediction X;, also requires a model of the system dynamics. 

Previous results in Kappen (2005a), Kappen (2007), Kappen (2005b) and Broek et al. (2008) 
are special cases of our generalized formulation. In the next section we show how our generalized 
formulation is specialized to different classes of stochastic dynamical systems and we provide the 
corresponding formula of local controls for each class. 


2.5 Special Cases 


The purpose of this section is twofold. First, it demonstrates how to apply the path integral approach 
to specialized forms of dynamical systems, and how the local controls in (20) simplify for these 
cases. Second, this section prepares the special case which we will need for our reinforcement 
learning algorithm in Section 3. 


2.5.1 SYSTEMS WITH ONE DIMENSIONAL DIRECTLY ACTUATED STATE 


The generalized formulation of stochastic optimal control with path integrals in Table 1 can be 
applied to a variety of stochastic dynamical systems with different types of control transition matri- 
ces. One case of particular interest is where the dimensionality of the directly actuated part of the 
state is 1D, while the dimensionality of the control vector is 1D or higher dimensional. As will be 
seen below, this situation arises when the controls are generated by a linearly parameterized func- 
tion approximator. The control transition matrix thus becomes a row vector Gg? = gor E RISP, 
According to (20), the local controls for such systems are expressed as follows: 


R-! (c) Š 
uz (ti) = 5 (s a —by,) i 
8; R- 8; 
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e Given: 


— The system dynamics x, = f; + G; (u; + £+) (cf. 3) 

— The immediate cost r, = qr + su; Ru, (cf. 4) 

— A terminal cost term p; (cf. 1) 

— The variance X¢ of the mean-zero noise € 

— Trajectory starting at t; and ending at ty: Ti = (X;,,...--,Xry) 

- A partitioning of the system dynamics into (c) controlled and (m) uncontrolled equa- 


tions, where n = c +m is the dimensionality of the state x, (cf. Section 2.3) 


e Optimal Controls: 


Optimal controls at every time step t;: u; = f P(t;) u(t) dv 


(T;) 


ek 
~ _laTt) 
f e ran, 


Probability of a trajectory: P (t;) = 


Generalized trajectory cost: §(t;) = S(t;) + 2o log |H;,| where 


© _ 0 2 
Xega ži _ £9) 


* S(T) = iy + ye qrjadt + a di tj 














JT 
* H, = GORGO 
-1 
-— Local Controls: uz (t;) = RIGT (cP RGP") (Gie, = bi.) where 
* b; = AH,,®;, 


x [®,,] ; = strace (a (0,4) 


Table 1: Summary of optimal control derived from the path integral formalizm. 





Since the directly actuated part of the state is 1D, the vector x) collapses into the scalar x) 


which appears in the partial differentiation above. In the case that gO does not depend on x”, the 


differentiation with respect to x) results to zero and the the local controls simplify to: 
Rog, 

uz (Ti) = (\T ; 20 ti 

$i; R- S; 

2.5.2 SYSTEMS WITH PARTIALLY ACTUATED STATE 


The generalized formula of the local controls (20) was derived for the case where the control transi- 


tion matrix is state dependent and its dimensionality is Gg? € K!*? with Z < n and p the dimension- 


ality of the control. There are many special cases of stochastic dynamical systems in optimal control 
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and robotic applications that belong into this general class. More precisely, for systems having a 
state dependent control transition matrix that is square (Ge € &!*! with Z = p) the local controls 
based on (20) are reformulated as: 


7 —1 
uz(t;) =£, -G by. 21) 


Interestingly, a rather general class of mechanical systems such as rigid-body and multi-body 
dynamics falls into this category. When these mechanical systems are expressed in state space 
formulation, the control transition matrix is equal to rigid body inertia matrix G6? = M(8,,) (Sci- 
avicco and Siciliano, 2000). Future work will address this special topic of path integral control for 
multi-body dynamics. 

Another special case of systems with partially actuated state is when the control transition matrix 


is state independent and has dimensionality Gg = G\ € 8!*?. The local controls, according to 
(20), become: 


-1 
uz(t;) =R GO” (GOR GO") Ge, (22) 
If 6? is square and state independent, Gg = G60 €R!*!, we will have: 
uz(T;) = &,. (23) 


This special case was explored in Kappen (2005a), Kappen (2007), Kappen (2005b) and Broek 
et al. (2008). Our generalized formulation allows a broader application of path integral control 
in areas like robotics and other control systems, where the control transition matrix is typically 
partitioned into directly and non-directly actuated states, and typically also state dependent. 


2.5.3 SYSTEMS WITH FULLY ACTUATED STATE SPACE 


In this class of stochastic systems, the control transition matrix is not partitioned and, therefore, the 


control u directly affects all the states. The local controls for such systems are provided by simply 


substituting Go Ee R”XP in (20) with G, € R”*”. Since G, is a square matrix we obtain: 


uz (tT) = Er; — G; 'b;, 
with b; = AH, P, and 
1 _ 
(,); = trace (H, (dx, H) ) ! 


where the differentiation is not taken with respect to (x!) j but with respect to the full state (x; )j. 
For this fully actuated state space, there are subclasses of dynamical systems with square and/or 
state independent control transition matrix. The local controls for these cases are found by just 


substituting Gg? with G; in (21), (22) and (23). 


3. Reinforcement Learning with Parameterized Policies 


Equipped with the theoretical framework of stochastic optimal control with path integrals, we can 
now turn to its application to reinforcement learning with parameterized policies. Since the be- 
ginning of actor-critic algorithms (Barto et al., 1983), one goal of reinforcement learning has been 
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to learn compact policy representations, for example, with neural networks as in the early days of 
machine learning (Miller et al., 1990), or with general parameterizations (Peters, 2007; Deisenroth 
et al., 2009). Parameterized policies have much fewer parameters than the classical time-indexed 
approach of optimal control, where every time step has it own set of parameters, that is, the optimal 
controls at this time step. Usually, function approximation techniques are used to represent the op- 
timal controls and the open parameters of the function approximator become the policy parameters. 
Function approximators use a state representation as input and not an explicit time dependent rep- 
resentation. This representation allows generalization across states and promises to achieve better 
generalization of the control policy to a larger state space, such that policies become re-usable and 
do not have to be recomputed in every new situation. 

The path integral approach from the previous sections also follows the classical time-based 
optimal control strategy, as can be seen from the time dependent solution for optimal controls in 
(33). However, a minor re-interpretation of the approach and some small mathematical adjustments 
allow us to carry it over to parameterized policies and reinforcement learning, which results in a 
new algorithm called Policy Improvement with Path Integrals (PI’). 


3.1 Parameterized Policies 


We are focusing on direct policy learning, where the parameters of the policy are adjusted by a 
learning rule directly, and not indirectly as in value function approaches of classical reinforcement 
learning (Sutton and Barto, 1998)—-see Peters (2007) for a discussion of pros and cons of direct 
vs. indirect policy learning. Direct policy learning usually assumes a general cost function (Sutton 
et al., 2000; Peters, 2007) in the form of 


76) = JPR), (24) 


which is optimized over states-action trajectories’ To = (Xn, ar, -+-) Xry )- Under the first order Markov 
property, the probability of a trajectory is 


p(t) = P(X, | PE [Xaar )p (an |X). 


Both the state transition and the policy are assumed to be stochastic. The particular formulation 
of the stochastic policy is a design parameter, motivated by the application domain, analytical con- 
venience, and the need to inject exploration during learning. For continuous state action domains, 
Gaussian distributions are most commonly chosen (Gullapalli, 1990; Williams, 1992; Peters, 2007). 
An interesting generalized stochastic policy was suggested in Rueckstiess et al. (2008) and applied 
in Koeber and Peters (2008), where the stochastic policy p(a;,|x;,) is linearly parameterized as: 


a, = g} (0+€;), (25) 


with g, denoting a vector of basis functions and 9 the parameter vector. This policy has state de- 
pendent noise, which can contribute to faster learning as the signal-to-noise ratio becomes adaptive 
since it is a function of g;,. It should be noted that a standard additive-noise policy can be expressed 
in this formulation, too, by choosing one basis function (g;,) ; = 0. For Gaussian noise € the proba- 
bility of an action is p(a,,|x;,) =N (0"g,,,2;,) with X, = g7 Xeg,,. Comparing the policy formulation 





5. We use a; to denote actions here in order to avoid using the symbol u in a conflicting way in the equations below, and 
to emphasize that an action does not necessarily coincide with the control command to a physical system. 
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in (25) with the control term in (3), one recognizes that the control policy formulation (25) should 
fit into the framework of path integral optimal control. 


3.2 Generalized Parameterized Policies 


Before going into more detail of our proposed reinforcement learning algorithm, it is worthwhile 
contemplating what the action a; actually represents. In many applications of stochastic optimal 
control there are three main problems to be considered: i) trajectory planning, ii) feedforward con- 
trol, and iii) feedback control. The results of optimization could thus be an optimal kinematic 
trajectory, the corresponding feedforward commands to track the desired trajectory accurately in 
face of the system’s nonlinearities, and/or time varying linear feedback gains (gain scheduling) for 
a negative feedback controller that compensates for perturbations from accurate trajectory tracking. 

There are very few optimal control algorithms which compute all three issues simultaneously, 
such as Differential Dynamic Programming(DDP) (Jacobson and Mayne, 1970), or its simpler ver- 
sion the Iterative Linear Quadratic Regulator@LQR) (Todorov, 2005). However, these are model 
based methods which require rather accurate knowledge of the dynamics and make restrictive as- 
sumptions concerning differentiability of the system dynamics and the cost function. 

Path integral optimal control allows more flexibility than these related methods. The concept of 
an “action” can be viewed in a broader sense. Essentially, we consider any “input” to the control 
system as an action, not unlike the inputs to a transfer function in classical linear control theory. 
The input can be a motor command, but it can also be anything else, for instance, a desired state, 
that is subsequently converted to a motor command by some tracking controller, or a control gain 
(Buchli et al., 2010) . As an example, consider a robotic system with rigid body dynamics (RBD) 
equations (Sciavicco and Siciliano, 2000) using a parameterized policy: 


ä = M(q) '(—C(q,4) —v(q)) +M(q) ‘u, (26) 
= G(q)(9+é,), (27) 


where M is the RBD inertia matrix, C are Coriolis and centripetal forces, and v denotes gravity 
forces. The state of the robot is described by the joint angles q and joint velocities q. The policy 
(27) is linearly parameterized by 8, with basis function matrix G—one would assume that the di- 
mensionality of O is significantly larger than that of q to assure sufficient expressive power of this 
parameterized policy. Inserting (27) into (26) results in a differential equation that is compatible 
with the system equations (3) for path integral optimal control: 


ä = f(4,4)+G(a)(0+¢;,) (28) 
where 
Gq) = M(q) 'G(q). 


This example is a typical example where the policy directly represents motor commands. 
Alternatively, we could create another form of control structure for the RBD system: 


ä = M(q) '(—C(q,4) —v(q)) +M(q) ‘un, 


Kp(qa— q) +Kp(qa— 4), 
G(qa,qa)(8+ &,). (29) 


u 


Ga 
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Here, a Proportional-Derivative (PD) controller with positive definite gain matrices Kp and Kp 
converts a desired trajectory q4, Q4 into a motor command u. In contrast to the previous example, 
the parameterized policy generates the desired trajectory in (29), and the differential equation for 
the desired trajectory is compatible with the path integral formalism. 

What we would like to emphasize is that the control system’s structure is left to the creativity 
of its designer, and that path integral optimal control can be applied on various levels. Importantly, 
as developed in Section 2.3, only the controlled differential equations of the entire control system 
contribute to the path integral formalism, that is, (28) in the first example, or (29) in the second 
example. And only these controlled differential equations need to be known for applying path 
integral optimal control—none of the variables of the uncontrolled equations is ever used. 

At this point, we make a very important transition from model-based to model-free learning. 
In the example of (28), the dynamics model of the control system needs to be known to apply 
path integral optimal control, as this is a controlled differential equation. In contrast, in (29), the 
system dynamics are in an uncontrolled differential equation, and are thus irrelevant for applying 
path integral optimal control. In this case, only knowledge of the desired trajectory dynamics is 
needed, which is usually created by the system designer. Thus, we obtained a model-free learning 
system. 


3.3 Dynamic Movement Primitives as Generalized Policies 


As we are interested in model-free learning, we follow the control structure of the 2”! example of 
the previous section, that is, we optimize control policies which represent desired trajectories. We 
use Dynamic Movement Primitives (DMPs) ([jspeert et al., 2003) as a special case of parameterized 
policies, which are expressed by the differential equations: 


1 
z“ = fi+9; (0+8), (30) 


1. 
Sais a Lbs 


=X; = —OX;, 
fi = (belg y) —2). 


Essentially, these policies code a learnable point attractor for a movement from y,, to the goal 
g, where O determines the shape of the attractor. y,,y, denote the position and velocity of the 
trajectory, while z,,x; are internal states. ,,B,,1 are time constants. The basis functions g; € 
R?*! are defined by a piecewise linear function approximator with Gaussian weighting kernels, as 
suggested in Schaal and Atkeson (1998): 


Wj Xt 
P 
Èz- Wk 


w; = exp (—0.5h;(x;—c;)”), (31) 





[s]; = (g—yo), 


with bandwith h; and center c; of the Gaussian kernels—for more details see Ijspeert et al. (2003). 
The DMP representation is advantageous as it guarantees attractor properties towards the goal while 
remaining linear in the parameters © of the function approximator. By varying the parameter 6 the 
shape of the trajectory changes while the goal state g and initial state y,, remain fixed. These 
properties facilitate learning (Peters and Schaal, 2008a). 
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3.4 Policy Improvements with Path Integrals: The (PI?) Algorithm 


As can be easily recognized, the DMP equations are of the form of our control system (3), with only 
one controlled equation and a one dimensional actuated state. This case has been treated in Section 
2.5.1. The motor commands are replaced with the parameters 0—the issue of time dependent vs. 
constant parameters will be addressed below. More precisely, the DMP equations can be written as: 


Xt — OX} O1xp 
4 = Yt T Oixp (0; +€;). 
ve a,(Be(g — yr) — z) gr 


The state of the DMP is partitioned into the controlled part x = y; and uncontrolled part 


x™ = (x z)”. The control transition matrix depends on the state, however, it depends only on one 
of the state variables of the uncontrolled part of the state, that is, x. The path cost for the stochastic 


dynamics of the DMPs is given by: 


























N-1 1 N=! || x -x0 T The N=! 
ç 2 j+ j c 
S(t) = wt by qjdt +5 $ | T f | _dt+5 3 log |H; | 
j=i j= || IE j=i 
N-1 1 N=! j 2 
~ wt Yay ts E e Oe 
j=i j=i tj 
N-1 N-1 
1 1 7 T 
= Owt L +5 L 5 (6, e) gH, gO (Or; +€:,) 
j=i j=i 
N-1 N-1 (c) (9T 
1 1 r Se Bt; 
= Oz T dt; (0;, E) z 2 i 7 (0;, f E) 
s L 7 7k ed gOTR- gO Lorn) 
N-1 1X21] sa 
a ty t L dt; 2 py 5 (8; Er) M; ,RM,,(9;, +E). (32) 
j=i j=i 
: Rg) g (TF p-1 4) : 
with M,, = ka H; becomes a scalar given by H; = g; R` g; °. Interestingly, the term 
tj tj 


A nes log |H;,| for the case of DMPs depends only on x;, which is a deterministic variable and 


therefore can be ignored since it is the same for all sampled paths. We also absorbed, without 
loss of generality, the time step dt in cost terms. Consequently, the fundamental result of the path 
integral stochastic optimal problem for the case of DMPs is expressed as: 





u, = f P(t) uz (ti) d1 |, (33) 











where the probability P(t;) and local controls u(t;) are defined as 


—15(T;) Roa) pT 
ek Sr Sr 
P(t) = ~ hit a.” uz(T;) = m ti 
Je z5 Ddr; g; Rig; 
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and the path cost given as 


N-1 N-1 

P 1 

S(t) =Ow+ Ved ts eT M] RM, &;,. 
J5! j=l 


Note that 6 = 0 in these equations, that is, the parameters are initialized to zero. These equations 
correspond to the case where the stochastic optimal control problem is solved with one evaluation 
of the optimal controls (33) using dense sampling of the whole state space under the “passive dy- 
namics” (i.e., © = 0), which requires a significant amount of exploration noise. Such an approach 
was pursued in the original work by Kappen (2007) and Broek et al. (2008), where a potentially 
large number of sample trajectories was needed to achieve good results. Extending this sampling 
approach to high dimensional spaces, however, is daunting, as with very high probability, we would 
sample primarily rather useless trajectories. Thus, biasing sampling towards good initial conditions 
seems to be mandatory for high dimensional applications. 

Thus, we consider only local sampling and an iterative update procedure. Given a current guess 
of O, we generate sample roll-outs using stochastic parameters 0 + £; at every time step. To see how 
the generalized path integral formulation is modified for the case of iterative updating, we start with 
the equations of the update of the parameter vector 8, which can be written as: 











—1 T 
(new) __ R gygy (O +E) 
0; = f (Ti) g, TR !g, dt; 
_— [Po R88 En ie, \ R` !g,g,10 
i g, R-'g,, a g, R Ig, 
R` !g,g,T 
— 80, 21,81; 9 
' "trace (R7'!g,,g/,") 
= 60, +M,9. (34) 
R! Sr; St; TE; 


The correction parameter vector 50,, is defined as 0, = fP (t;) dt;. It is important to 


g; Rog, 
note that er") is now time dependent, that is, for every time step ¢;, a different optimal parameter 
vector is computed. In order to return to one single time independent parameter vector oew), the 

(new) ; 
vectors ©, — need to be averaged over time fj. 
We start with a first tentative suggestion of averaging over time, and then explain why it is 
inappropriate, and what the correct way of time averaging has to look like. The tentative and most 
intuitive time average is: 


ere = — ) OF’ =— ) 0, +— ) M,9. 
N i= N i= N i= 
Thus, we would update O based on two terms. The first term is the average of 50,,, which is reason- 
able as it reflects the knowledge we gained from the exploration noise. However, there would be a 
second update term due to the average over projected mean parameters O from every time step—it 
should be noted that M, is a projection matrix onto the range space of g,, under the metric R~!, such 
that a multiplication with M, can only shrink the norm of 8. From the viewpoint of having optimal 
parameters for every time step, this update component is reasonable as it trivially eliminates the part 
of the parameter vector that lies in the null space of g;, and which contributes to the command cost 
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of a trajectory in a useless way. From the view point of a parameter vector that is constant and time 
independent and that is updated iteratively, this second update is undesirable, as the multiplication 
of the parameter vector O with M; in (34) and the averaging operation over the time horizon reduces 
the L norm of the parameters at every iteration, potentially in an uncontrolled way. What we 
rather want is to achieve convergence when the average of 50,, becomes zero, and we do not want 
to continue updating due to the second term. 

The problem is avoided by eliminating the projection matrix in the second term of averaging, 


such that it become: 
N-1 1 Ne! 1 Nz! 


at) = 5 E it L o= y 2 8, +0 


The meaning of this reduced update is simply that we keep a component in 6 that is irrelevant and 
contributes to our trajectory cost in a useless way. However, this irrelevant component will not 
prevent us from reaching the optimal effective solution, that is, the solution that lies in the range 
space of g. Given this modified update, it is, however, also necessary to derive a compatible cost 
function. As mentioned before, in the unmodified scenario, the last term of (32) is: 


N— 
Z (0 +e:,)" M; RM; (0 +e) 


as 


To avoid a projection of 8, we modify this cost term to be: 


1 N=! 
= SX (0+M,,€,,)’R(@+M,<€;,). 
255 i 


With this modified cost term, the path integral formalism results in the desired ane") without the 
M,, projection of 8. 

The main equations of the iterative version of the generalized path integral formulation, called 
Policy Improvement with Path Integrals (PI’), can be summarized as: 





eo 5(T) rx 
Pa) = 
N-1 1 N-1 7 
S (ti) = Ory + >: qdt + 7 > (0 +M,,€;;) R(8 +M,,€;,)dt, (36) 
j=i j=i 
ô0;, — [PC Mieidti (37) 
N-1 5 
~ 9 (N — i) wjn [88%] ; 
poy = Ea Ni wi Pod, 8) 
Li- =0 Wj, 1,(N — i) 


gew) = g(2!4) +80. 


Essentially, (35) computes a discrete probability at time t; of each trajectory roll-out with the help 
of the cost (36). For every time step of the trajectory, a parameter update is computed in (37) based 





6. To be precise, 8 would be projected and continue shrinking until it lies in the intersection of all null spaces of the g; 
basis function—this null space can easily be of measure zero. 
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on a probability weighted average over trajectories. The parameter updates at every time step are 
finally averaged in (38). Note that we chose a weighted average by giving every parameter update a 
weight’ according to the time steps left in the trajectory and the activation of the kernel in (31). This 
average can be interpreted as using a function approximator with only a constant (offset) parameter 
vector to approximate the time dependent parameters. Giving early points in the trajectory a higher 
weight is useful since their parameters affect a large time horizon and thus higher trajectory costs. 
Other function approximation (or averaging) schemes could be used to arrive at a final parameter 
update—we preferred this simple approach as it gave very good learning results. The final parameter 
update is 0") = (4) + 80. 

The parameter A regulates the sensitivity of the exponentiated cost and can automatically be 
optimized for every time step i to maximally discriminate between the experienced trajectories. 
More precisely, a constant term can be subtracted from (36) as long as all S(t;) remain positive—this 
constant term® cancels in (35). Thus, for a given number of roll-outs, we compute the exponential 


term in (35) as 
Hie S(t) — min S(t;) 
exp (-75@)) ie ( ia CO a 





with h set to a constant, which we chose to be h = 10 in all our evaluations. The max and min 
operators are over all sample roll-outs. This procedure eliminates À and leaves the variance of the 
exploration noise £ as the only open algorithmic parameter for PI’. It should be noted that the 
equations for PI’ have no numerical pitfalls: no matrix inversions and no learning rates,” rendering 
PI’ to be very easy to use in practice. 

The pseudocode for the final PI? algorithm for a one dimensional control system with function 
approximation is given in Table 2. A tutorial Matlab example of applying PI? can be found at 
http://www-clmc.usc.edu/software . 


4. Related Work 


In the next sections we discuss related work in the areas of stochastic optimal control and rein- 
forcement learning and analyze the connections and differences with the PI’ algorithm and the 
generalized path integral control formulation. 


4.1 Stochastic Optimal Control and Path Integrals 


The path integral formalism for optimal control was introduced in Kappen (2005a,b). In this work, 
the role of noise in symmetry breaking phenomena was investigated in the context of stochastic 
optimal control. In Kappen et al. (2007), Wiegerinck et al. (2006), and Broek et al. (2008), the path 
integral formalism is extended for the stochastic optimal control of multi-agent systems. 

Recent work on stochastic optimal control by Todorov (2008), Todorov (2007) and Todorov 
(2009b) shows that for a class of discrete stochastic optimal control problems, the Bellman equa- 





7. The use of the kernel weights in the basis functions (31) for the purpose of time averaging has shown better perfor- 
mance with respect to other weighting approaches, across all of our experiments. Therefore this is the weighting that 
we suggest. Users may develop other weighting schemes as more suitable to their needs. 

8. In fact, the term inside the exponent results by adding a er 
= hS(T:) 

max S(T;)—min S(T;) 

9. R is a user design parameter and usually chosen to be diagonal and invertible. 


which cancels in (35), to the term 


which is equal to — S(t). 
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e Given: 


— An immediate cost function r; = gq, + o7 RO, (cf. 1) 
— A terminal cost term Ọ;, (cf. 1) 


— A stochastic parameterized policy a; = g7 (0 + €,) (cf. 25) 


The basis function g;, from the system dynamics (cf. 3 and Section 2.5.1) 


The variance X¢ of the mean-zero noise €; 


The initial parameter vector 0 
e Repeat until convergence of the trajectory cost R: 


— Create K roll-outs of the system from the same start state xq using stochstic parameters 
0+ €, at every time step 


— Fork = 1...K, compute: 
-15(T;,) 
* P (tix) = Eee 
DE fe BT) 
N-1 15-1 

* S(Tik) =O a+ Xizi dtk t3 Xi (0+ Mz, k€; x) R(0 a M; kEr;k) 
Rg kg, 
x M; ; = —— 4 

tj,k gi R'E k 


- For i = 1...(N — 1), compute: 
* 60; = Ei [P (Tik) Mak Enx] 


Nl (N—i) wjz, [88;,]; 
— Compute [59] ; = a a di 





— Update 0+ 6+60 


— Create one noiseless roll-out to check the trajectory cost R = Qy + Ee ie rą. In case 
the noise cannot be turned off, that is, a stochastic system, multiple roll-outs need be 
averaged. 





Table 2: Pseudocode of the PI* algorithm for a 1D Parameterized Policy (Note that the discrete 
time step dt was absorbed as a constant multiplier in the cost terms). 


tion can be written as the KL divergence between the probability distribution of the controlled and 
uncontrolled dynamics. Furthermore it is shown that the class of discrete KL divergence control 
problem is equivalent to the continuous stochastic optimal control formalism with quadratic cost 
control function and under the presence of Gaussian noise. In Kappen et al. (2009), the KL diver- 
gence control formalism is considered and it is transformed to a probabilistic inference problem. 
In all this aforementioned work, both in the path integral formalism as well as in KL divergence 
control, the class of stochastic dynamical systems under consideration is rather restrictive since the 
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control transition matrix is state independent. Moreover, the connection to direct policy learning in 
RL and model-free learning was not made in any of the previous projects. 
Our PI’ algorithm differs with respect to the aforementioned work in the following points. 


In Todorov (2009b) the stochastic optimal control problem is investigated for discrete action 
- state spaces and therefore it is treated as Markov Decision Process (MDP). To apply our PI? 
algorithm, we do not discretize the state space and we do not treat the problem as an MDP. 
Instead we work in continuous state - action spaces which are suitable for performing RL in 
high dimensional robotic systems. To the best of our knowledge, our results present RL in 
one of the most high dimensional continuous state action spaces. 


In our derivations, the probabilistic interpretation of control comes directly from the Feynman- 
Kac Lemma. Thus we do not have to impose any artificial “pseudo-probability“ treatment of 
the cost as in Todorov (2009b). In addition, for the continuous state - action spaces we do not 
have to learn the value function as it is suggested in Todorov (2009b) via Z-learning. Instead 
we directly find the controls based on our generalization of optimal controls. 


In the previous work, the problem of how to sample trajectories is not addressed. Sampling 
is performed at once with the hope to cover the all state space. We follow a rather different 
approach that allows to attack robotic learning problems of the complexity and dimensionality 
of the little dog robot. 


The work in Todorov (2009a) considers stochastic dynamics with state dependent control 
matrix. However, the way of how the stochastic optimal control problem is solved is by 
imposing strong assumptions on the structure of the cost function and, therefore, restrictions 
of the proposed solution to special cases of optimal control problems. The use of this specific 
cost function allows transforming the stochastic optimal control problem to a deterministic 
optimal control problem. Under this transformation, the stochastic optimal control problem 
can be solved by using deterministic algorithms. 


With respect to the work in Broek et al. (2008), Wiegerinck et al. (2006) and Kappen et al. 
(2009) our PI? algorithm has been derived for a rather general class of systems with control 
transition matrix that is state dependent. In this general class, Rigid body and multi-body 
dynamics as well as the DMPs are included. Furthermore we have shown how our results 
generalize previous work. 


4.2 Reinforcement Learning of Parameterized Policies 


There are two main classes of related algorithms: Policy Gradient algorithms and probabilistic 
algorithms. 

Policy Gradient algorithms (Peters and Schaal, 2006a,b) compute the gradient of the cost func- 
tion (24) at every iteration and the policy parameters are updated according to 6”) = 04) + 
AVJ. Some well-established algorithms, which we will also use for comparisons, are as follows 
(see also Peters and Schaal, 2006a,b). 


4.2.1 REINFORCE 


Williams (1992) introduced the episodic REINFORCE algorithm, which is derived from taking the 
derivative of (24) with respect to the policy parameters. This algorithm has rather slow convergence 
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due to a very noisy estimate of the policy gradient. It is also very sensitive to a reward baseline 
parameter bg (see below). Recent work derived the optimal baseline for REINFORCE (cf. Peters 
and Schaal, 2008a), which improved the performance significantly. The episodic REINFORCE 
update equations are: 


Vo, = Et 
i=0 





N-1 
(R(T) — by) y va nna) ’ 


Er, | (E25! Vo, In p(a;, x4 ))? R(t0)| 


Er, KE Vo In p(ar, |x;,)) A 





bg = 


where k denotes the k-th coefficient of the parameter vector and R(to) = > ye Fy Fi. 


4.2.2 GPOMDP AND THE POLICY GRADIENT THEOREM ALGORITHM 


In their GPOMDP algorithm, Baxter and Bartlett (2001) introduced several improvements over RE- 
INFORCE that made the gradient estimates more efficient. GPOMDP can also be derived from the 
policy gradient theorem (Sutton et al., 2000; Peters and Schaal, 2008a), and an optimal reward base- 
line can be added (cf. Peters and Schaal, 2008a). In our context, the GPOMDP learning algorithm 
can be written as: 


N-1 j 
VaJ = Eq E (HP) É (aot) 
0 


j=0 i= 





Er, (Vo, In p(a, Ix)? ri 


Er, |(Vo, Inp(aylx,))" 





p% 


ti 


4.2.3 THE EPISODIC NATURAL ACTOR CRITIC 


One of the most efficient policy gradient algorithm was introduced in Peters and Schaal (2008b), 
called the Episodic Natural Actor Critic. In essence, the method uses the Fisher Information Matrix 
to project the REINFORCE gradient onto a more effective update direction, which is motivated by 
the theory of natural gradients by Amari (1999). The eNAC algorithm takes the form of: 


Vo, In p(a;,[X;, 
Erik = | Ok A nl ti) h 
Vol N-1 z =l N-1 
[50] = euE Etal Eu [RC D Era 
i=0 i=0 











where Jo is a constant offset term. 


4.2.4 POWER 


The PoWER algorithm (Koeber and Peters, 2008) is a probabilistic policy improvement method, not 
a gradient algorithm. Itis derived from an Expectation-Maximization framework using probability 
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matching (Dayan and Hinton, 1997; Peters and Schaal, 2008c). Using the notation of this paper, the 
parameter update of POWER becomes: 




















€ 
50 = Ey Y R,— ggi ae 5 R; BB t 
i=0 gig, Sr tito gi Sr; 
where R, = yi j= Ty If we set R~! =c I in the update (37) of PI’, and set = a = Tin the matrix 


lj 

inversion term of the two algorithms look essentially identical. But it should be noted that 
the rewards r, in POWER need to behave like an improper probability, that is, be strictly positive 
and integrate to a constant number—this property can make the design of suitable cost functions 
more complicated. PL’, in contrast, uses exponentiated sum of reward terms, where the immedi- 
ate reward can be arbitrary, and only the cost on the motor commands needs be quadratic. Our 
empirical evaluations revealed that, for cost functions that share the same optimum in the POWER 
pseudo-probability formulation and the PI’ notation, both algorithms perform essentially identical, 
indicating that the matrix inversion term in POWER may be unimportant for many systems. It should 
be noted that in Vlassis et al. (2009), POWER was extended to the discounted infinite horizon case, 
where PoWER is the special case of a non-discounted finite horizon problem. 


5. Evaluations 


We evaluated PI? in several synthetic examples in comparison with REINFORCE, GPOMDP, 
eNAC, and, when possible, POWER. Except for POWER, all algorithms are suitable for optimiz- 
ing immediate reward functions of the kind r, = q; +u,Ru,. As mentioned above, POWER requires 
that the immediate reward behaves like an improper probability. This property is incompatible with 
r; = qr + u;Ru, and requires some special nonlinear transformations, which usually change the na- 
ture of the optimization problem, such that POWER optimizes a different cost function. Thus, only 
one of the examples below has a compatible a cost function for all algorithms, including POWER. In 
all examples below, exploration noise and, when applicable, learning rates, were tuned for every in- 
dividual algorithms to achieve the best possible numerically stable performance. Exploration noise 
was only added to the maximally activated basis function in a motor primitive,!° and the noise was 
kept constant for the entire time that this basis function had the highest activation—empirically, this 
tick helped improves the learning speed of all algorithms. 


5.1 Learning Optimal Performance of a 1 DOF Reaching Task 


The first evaluation considers learning optimal parameters for a 1 DOF DMP (cf. Equation 30). The 
immediate cost and terminal cost are, respectively: 


r, = 0.5f; +5000 070, P = 10000(97, + 10(g — yw”) 


with yn = 0 and g = 1—we use radians as units motivated by our interest in robotics application, 
but we could also avoid units entirely. The interpretation of this cost is that we would like to reach 
the goal g with high accuracy while minimizing the acceleration of the movement and while keeping 
the parameter vector short. Each algorithm was run for 15 trials to compute a parameter update, and 





10. That is, the noise vector in (25) has only one non-zero component. 
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a total of 1000 updates were performed. Note that 15 trials per update were chosen as the DMP 
had 10 basis functions, and the eNAC requires at least 11 trials to perform a numerically stable 
update due to its matrix inversion. The motor primitives were initialized to approximate a 5-th 
order polynomial as point-to-point movement (cf. Figure 1a,b), called a minimum-jerk trajectory 
in the motor control literature; the movement duration was 0.5 seconds, which is similar to normal 
human reaching movements. Gaussian noise of N(0,0.1) was added to the initial parameters of the 
movement primitives in order to have different initial conditions for every run of the algorithms. 
The results are given in Figure 1. Figure 1a,b show the initial (before learning) trajectory generated 
by the DMP together with the learning results of the four different algorithms after learning— 
essentially, all algorithms achieve the same result such that all trajectories lie on top of each other. 
In Figure 1c, however, it can be seen that PI’ outperforms the gradient algorithms by an order 
of magnitude. Figure 1d illustrates learning curves for the same task as in Figure 1c, just that 
parameter updates are computed already after two roll-outs—the eNAC was excluded from this 
evaluation as it would be too heuristic to stabilize its ill-conditioned matrix inversion that results 
from such few roll-outs. PI’ continues to converge much faster than the other algorithms even in 
this special scenario. However, there are some noticeable fluctuation after convergence. This noise 
around the convergence baseline is caused by using only two noisy roll-outs to continue updating 
the parameters, which causes continuous parameter fluctuations around the optimal parameters. 
Annealing the exploration noise, or just adding the optimal trajectory from the previous parameter 
update as one of the roll-outs for the next parameter update can alleviate this issue—we do not 
illustrate such little “tricks” in this paper as they really only affect fine tuning of the algorithm. 


5.2 Learning Optimal Performance of a 1 DOF Via-Point Task 


The second evaluation was identical to the first evaluation, just that the cost function now forced 
the movement to pass through an intermediate via-point at t = 300ms. This evaluation is an abstract 
approximation of hitting a target, for example, as in playing tennis, and requires a significant change 
in how the movement is performed relative to the initial trajectory (Figure 2a). The cost function 
was 

T300ms = 100000000 (G — Yno) Ory =O 


with G = 0.25. Only this single reward was given. For this cost function, the POWER algorithm 
can be applied, too, with cost function 7300ms = exp(—1/À r300ms) and 7, = 0 otherwise. This 
transformed cost function has the same optimum as r300ms. The resulting learning curves are given in 
Figure 2 and resemble the previous evaluation: PI? outperforms the gradient algorithms by roughly 
an order of magnitude, while all the gradient algorithms have almost identical learning curves. As 
was expected from the similarity of the update equations, POWER and PY’ have in this special case 
the same performance and are hardly distinguishable in Figure 2. Figure 2a demonstrates that all 
algorithms pass through the desired target G, but that there are remaining differences between the 
algorithms in how they approach the target G—these difference have a small numerical effect in 
the final cost (where PI? and PoWER have the lowest cost), but these difference are hardly task 
relevant. 


5.3 Learning Optimal Performance of a Multi-DOF Via-Point Task 


A third evaluation examined the scalability of our algorithms to a high-dimensional and highly 
redundant learning problem. Again, the learning task was to pass through an intermediate target G, 
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Figure 1: Comparison of reinforcement learning of an optimized movement with motor primitives. 
a) Position trajectories of the initial trajectory (before learning) and the results of all algo- 
rithms after learning—the different algorithms are essentially indistighuishable. b) The 
same as a), just using the velocity trajectories. c) Average learning curves for the differ- 
ent algorithms with 1 std error bars from averaging 10 runs for each of the algorithms. d) 
Learning curves for the different algorithms when only two roll-outs are used per update 
(note that the eNAC cannot work in this case and is omitted). 


just that a d = 2,10, or 50 dimensional motor primitive was employed. We assume that the multi- 
DOF systems model planar robot arms, where d links of equal length / = 1 /d are connected in an 
open chain with revolute joints. Essentially, these robots look like a multi-segment snake in a plane, 
where the tail of the snake is fixed at the origin of the 2D coordinate system, and the head of the 
snake can be moved in the 2D plane by changing the joint angles between all the links. Figure 3b,d,f 
illustrate the movement over time of these robots: the initial position of the robots is when all joint 
angles are zero and the robot arm completely coincides with the x-axis of the coordinate frame. 
The goal states of the motor primitives command each DOF to move to a joint angle, such that the 
entire robot configuration afterwards looks like a semi-circle where the most distal link of the robot 
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Figure 2: Comparison of reinforcement learning of an optimized movement with motor primitives 
for passing through an intermediate target G. a) Position trajectories of the initial trajec- 
tory (before learning) and the results of all algorithms after learning. b) Average learning 
curves for the different algorithms with 1 std error bars from averaging 10 runs for each 
of the algorithms. 


(the end-effector) touches the y-axis. The higher priority task, however, is to move the end-effector 
through a via-point G = (0.5,0.5). To formalize this task as a reinforcement learning problem, we 
denote the joint angles of the robots as &;, with i = 1,2,...,d, such that the first line of (30) reads 
now as Ee = fist g/ (0; + €;,)—this small change of notation is to avoid a clash of variables with 
the (x,y) task space of the robot. The end-effector position is computed as: 


1 d f 1 d i 
x=- } cos(} 6j) y=>} sin’). Eja). 
dix j=l dij j=l 


The immediate reward function for this problem is defined as 





ye (d+1—1) (0.1f2, +0.5 6/6;) 





= l 39 
i dsi- A 
AF300ms = 100000000 ((0.5 — Xrom)” + (0-5 — Yrsoom) J + 
ry = 0, 


where Ar300ms is added to 7, at time t = 300ms, that is, we would like to pass through the via- 
point at this time. The individual DOFs of the motor primitive were initialized as in the 1 DOF 
examples above. The cost term in (39) penalizes each DOF for using high accelerations and large 
parameter vectors, which is a critical component to achieve a good resolution of redundancy in the 
arm. Equation (39) also has a weighting term d + 1 — i that penalizes DOFs proximal to the orgin 
more than those that are distal to the origin—intuitively, applied to human arm movements, this 
would mean that wrist movements are cheaper than shoulder movements, which is motivated by the 
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fact that the wrist has much lower mass and inertia and is thus energetically more efficient to move. 


The results of this experiment are summarized in Figure 3. The learning curves in the left 
column demonstrate again that PI’ has an order of magnitude faster learning performance than the 
other algorithms, irrespective of the dimensionality. PI? also converges to the lowest cost in all 
examples: 


























Algorithm 2-DOFs 10-DOFs 50-DOFs 
PI 98000 + 5000 15700 + 1300 2800 + 150 
REINFORCE | 125000 + 2000 22000+700 19500 24000 
PG 128000 +2000 28000423000 27000 = 40000 
NAC 113000 + 10000 48000+8000 22000 + 2000 











Figure 3 also illustrates the path taken by the end-effector before and after learning. All algo- 
rithms manage to pass through the via-point G appropriately, although the path particularly before 
reaching the via-point can be quite different across the algorithms. Given that PI? reached the low- 
est cost with low variance in all examples, it appears to have found the best solution. We also added 
a “stroboscopic” sketch of the robot arm for the PI’ solution, which proceeds from the very right to 
the left as a function of time. It should be emphasized that there were absolutely no parameter tun- 
ing needed to achieve the PI’ results, while all gradient algorithms required readjusting of learning 
rates for every example to achieve best performance. 


5.4 Application to Robot Learning 


Figure 4 illustrates our application to a robot learning problem. The robot dog is to jump across as 
gap. The jump should make forward progress as much as possible, as it is a maneuver in a legged 
locomotion competition which scores the speed of the robot—note that we only used a physical 
simulator of the robot for this experiment, as the actual robot was not available. The robot has three 
DOFs per leg, and thus a total of d = 12 DOFs. Each DOF was represented as a DMP with 50 
basis functions. An initial seed behavior (Figure 5-top) was taught by learning from demonstration, 
which allowed the robot barely to reach the other side of the gap without falling into the gap—the 
demonstration was generated from a manual adjustment of spline nodes in a spline-based trajectory 
plan for each leg. 


PI’ learning used primarily the forward progress as a reward, and slightly penalized the squared 
acceleration of each DOF, and the length of the parameter vector. Additionally, a penalty was 
incurred if the yaw or the roll exceeded a threshold value—these penalties encouraged the robot to 
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Figure 3: Comparison of learning multi-DOF movements (2,10, and 50 DOFs) with planar robot 
arms passing through a via-point G. a,c,e) illustrate the learning curves for different RL 
algorithms, while b,d,f) illustrate the end-effector movement after learning for all algo- 
rithms. Additionally, b,d,f) also show the initial end-effector movement, before learning 
to pass through G, and a “stroboscopic” visualization of the arm movement for the final 
result of PI’ (the movements proceed in time starting at the very right and ending by 
(almost) touching the y axis). 
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Figure 4: Reinforcement learning of optimizing to jump over a gap with a robot dog. The improve- 
ment in cost corresponds to about 15 cm improvement in jump distance, which changed 
the robot’s behavior from an initial barely successful jump to jump that completely tra- 
versed the gap with entire body. This learned behavior allowed the robot to traverse a gap 
at much higher speed in a competition on learning locomotion. The experiments for this 
paper were conducted only on the robot simulator. 


jump straight forward and not to the side, and not to fall over. The exact cost function is: 


d 
ry = Trott +Tyaw t+). (ai fh +0.Saz 070) (a, = 1.e—6,a2 = 1.e —8), 
i=l 
100 * (|roll,|—0.3)*, if (|roll,| > 0.3) 
Troll = : 
0, otherwise, 
100 * (|yaw,|—0.1)?, if (|yaw,| > 0.1) 
Naw = y 
0, otherwise, 


Ory = 50000 (goal — Xnose)”, 


where roll, yaw are the roll and yaw angles of the robot’s body, and Xnose is the position of the front 
tip (the “nose”) of the robot in the forward direction, which is the direction towards the goal. The 
multipliers for each reward component were tuned to have a balanced influence of all terms. Ten 
learning trials were performed initially for the first parameter update. The best 5 trials were kept, and 
five additional new trials were performed for the second and all subsequent updates. Essentially, this 
method performs importance sampling, as the rewards for the 5 trials in memory were re-computed 
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Figure 5: Sequence of images from the simulated robot dog jumping over a 14cm gap. Top: before 
learning. Bottom: After learning. While the two sequences look quite similar at the first 
glance, it is apparent that in the 4th frame, the robot’s body is significantly higher in the 
air, such that after landing, the body of the dog made about 15cm more forward progress 
as before. In particular, the entire robot’s body comes to rest on the other side of the gap, 
which allows for an easy transition to walking. In contrast, before learning, the robot’s 
body (and its hind legs) are still on the right side of the gap, which does not allow for a 
successful continuation of walking. 


with the latest parameter vectors. A total of 100 trials was performed per run, and ten runs were 
collected for computing mean and standard deviations of learning curves. 

Figure 4 illustrates that after about 30 trials (i.e., 5 updates), the performance of the robot was 
converged and significantly improved, such that after the jump, almost the entire body was lying on 
the other side of the gap. Figure 4 captures the temporal performance in a sequence of snapshots of 
the robot. It should be noted that applying PI’ was algorithmically very simple, and manual tuning 
only focused on generated a good cost function, which is a different research topic beyond the scope 
of this paper. 


6. Discussion 


This paper derived a more general version of stochastic optimal control with path integrals, based 
on the original work by Kappen (2007) and Broek et al. (2008). The key results were presented in 
Table 1 and Section 2.5, which considered how to compute the optimal controls for a general class 
of stochastic control systems with state-dependent control transition matrix. One important class 
of these systems can be interpreted in the framework of reinforcement learning with parameterized 
policies. For this class, we derived Policy Improvement with Path Integrals (PI’) as a novel algo- 
rithm for learning a parameterized policy. PI’ inherits its sound foundation in first order principles 
of stochastic optimal control from the path integral formalism. It is a probabilistic learning method 
without open algorithmic tuning parameters, except for the exploration noise. In our evaluations, 
PI? outperformed gradient algorithms significantly. It is also numerically simpler and has easier 
cost function design than previous probabilistic RL methods that require that immediate rewards 
are pseudo-probabilities. The similarity of PI’ with algorithms based on probability matching indi- 
cates that the principle of probability matching seems to approximate a stochastic optimal control 
framework. Our evaluations demonstrated that PI’ can scale to high dimensional control systems, 
unlike many other reinforcement learning systems. 
Some issues, however, deserve more detailed discussions in the following paragraphs. 
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6.1 The Simplification AR~! = Xe 


In order to obtain linear 2”? order differential equations for the exponentially transformed HJB equa- 
tions, the simplification ART! = Lg was applied. Essentially, this assumption couples the control 
cost to the stochasticity of the system dynamics, that is, a control with high variance will have rela- 
tively small cost, while a control with low variance will have relatively high cost. This assumption 
makes intuitively sense as it would be mostly unreasonable to attribute a lot of cost to an unreliable 
control component. Algorithmically, this assumption transforms the Gaussian probability for state 
transitions into a quadratic command cost, which is exactly what our immediate reward function 
postulated. Future work may allow removing this simplification by applying generalized versions 
of the Feynman-Kac Lemma. 


6.2 Model-based, Hybrid, and Model-free Learning 


Stochastic optimal control with path integrals makes a strong link to the dynamic system to be 
optimized—indeed, originally, it was derived solely as model-based method. As this paper demon- 
strated, however, this view can be relaxed. The roll-outs, needed for computing the optimal controls, 
can be generated either from simulating a model, or by gathering experience from an actual system. 
In the latter case, only the control transition matrix of the model needs be known, such that we obtain 
a hybrid model-based/model-free method. In this paper, we even went further and interpreted the 
stochastic dynamic system as a parameterized control policy, such that no knowledge of the model 
of the control system was needed anymore—that is, we entered a model-free learning domain. It 
seems that there is a rich variety of ways how the path integral formalism can be used in different 
applications. 


6.3 Rules of Cost Function Design 


The cost functions allowed in our formulations can have arbitrary state cost, but need quadratic 
command cost. This is somewhat restrictive, although the user can be flexible in what is defined as 
acommand. For instance, the dynamic movement primitives (30) used in this paper can be written 
in two alternative ways: 


7 = fe+gr(0+e:), 
or 


be = wa ([9]) 


where the new noise vector €; has one additional coefficient. The second equation treats f; as another 
basis function whose parameter is constant and is thus simply not updated. Thus, we added f; to the 
command cost instead of treating it as a state cost. 

We also numerically experimented with violations of the clean distinction between state and 
command cost. Equation (36) could be replaced by a cost term, which is an arbitrary function of 
state and command. In the end, this cost term is just used to differentiate the different roll-outs 
in a reward weighted average, similarly as in Peters and Schaal (2008c) and Koeber and Peters 
(2008). We noticed in several instances that PI? continued to work just fine with this improper cost 
formulation. 
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Again, it appears that the path integral formalism and the PI? algorithm allow the user to exploit 
creativity in designing cost functions, without absolute need to adhere perfectly to the theoretical 
framework. 


6.4 Dealing with Hidden State 


Finally, it is interesting to consider in how far PI? would be affected by hidden state. Hidden state 
can either be of stochastic or deterministic nature, and we consider hidden state as adding additional 
equations to the system dynamics (3). Section 2.3 already derived that deterministic hidden states 
drop out of the PI* update equations—these states of the system dynamics were termed as “ non- 
directly actuated” states. 


More interesting are hidden state variables that have stochastic differential equations, that is, 
these equations are uncontrolled but do have a noise term and a non-zero corresponding coefficient 
in G; in Equation (3), and these equations are coupled to the other equations through their passive 
dynamics. The noise term of these equations would, in theory, contribute terms in Equation (36), 
but given that neither the noise nor the state of these equations are observable, we will not have the 
knowledge to add these terms. However, as long as the magnitude of these terms is small relative to 
the other terms in Equation (36), PI’ will continue to work fine, just a bit sub-optimally. This issue 
would affect other reinforcement learning methods for parameterized policies in the same way, and 
is not specific to PI’. 


6.5 Arbitrary States in the Cost Function 


As a last point, we would like to consider which variables can actually enter the cost functions for 
PI’. The path integral approach prescribes that the cost function needs to be a function of the state 
and command variables of the system equations (3). It should be emphasized that the state cost q; 
can be any deterministic function of the state, that is, anything that is predictable from knowing the 
state, even if we do not know the predictive function. There is a lot of flexibility in this formulation, 
but it is also more restrictive than other approaches, for example, like policy gradients or the POWER 
algorithm, where arbitrary variables can be used in the cost, no matter whether they are states or 
not. 


We can think of any variable that we would like to use in the cost as having a corresponding 
differential equation in the system dynamics (3), that is, we simply add these variables as state 
variables, just that we do not know the analytical form of these equations. As in the previous 
section, it is useful to distinguish whether these states have deterministic or stochastic differential 
equations. 

If the differential equation is deterministic, we can cover the case with the derivations from 
Section 2.3, that is, we consider such an equation as uncontrolled deterministic differential equation 
in the system dynamics, and we already know that we can use its state in the cost without any 
problems as it does not contribute to the probability of a roll-out. 

If the differential equation is stochastic, the same argument as in the previous section applies, 
that is, the (unknown) contribution of the noise term of this equation to the exponentiated cost (36) 
needs to be small enough for PI? to work effectively. Future work and empirical evaluations will 
have to demonstrate when these issues really matter—so far, we have not encountered problems in 
this regard. 
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7. Conclusions 


The path integral formalism for stochastic optimal control has a very interesting potential to dis- 
cover new learning algorithms for reinforcement learning. The PI’ algorithm derived in this paper 
for learning with parameterized policies demonstrated a surprisingly good performance, literally 
without any need for manual tuning of the parameters of the algorithm. We also demonstrated that 
the algorithm scales well into very high dimensional domains that were previously hardly approach- 
able for reinforcement learning. Future work will thus allow us to focus much more on machine 
learning algorithms for cost function design, as the algorithmic components of the learning algo- 
rithm seem to be able to move towards a “black box” character. 
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Appendix A. 


Appendix A contains the lemmas Al and A2 and one theorem. The theorem provides the main 
result of the generalized path integral control formalism expressed by (18), (19), (20). Its proof is 
based on results proven in the lemmas A1 and A2. In appendix B we provide the Feynman-Kac 
formula and we sketch the corresponding proof. 


Lemma 1 : The optimal control solution to the stochastic optimal control problem expressed by 
(1),(2),(3) and (4) is formulated as: 


w= jim [RGT S AV oS 
dt0 4 Xi; 


_15t, 5 
where p(t) = aN a - is a path dependent probability distribution. The term S(t;) is 


Sexp(—45(Ti) aT 
a path function defined as S(t) = S(t) + Ay og |H,,| that satisfies the following condition 


limao f exp (15 (ti))dī; ecu for any sampled trajectory starting from state X;,. Moreover the 
term H, is given by H,, = GOR GOT while the term S(t;) is defined according to 


= (c) (c) 
y I Xij Xy WE dt 
ld dt tj Hi, 5 
j=i 





N-1 1 
S(T) = Oty + > qdt + 5 
jai 


Proof The optimal controls at the state x, is expressed by the equation u, = —R'G,Vx, V. Due 
to the exponential transformation of the value function Y, = —Alog V, the equation of the optimal 
controls is written as: 


Xi Y, i 


ARTIG X 
u; = ti Y, 
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In discrete time the optimal control is expressed as follows: 


V5, Be? 
u; = lim (iret e 


dt30 (dt) 
ti 


By using equation (17) and substituting Y4” (x,t) we have: 


“iar Vx, J EXP (—4Z(t;)) dt; 
(x 'G;, fexp (—4.Z(t;)) dt; i 





u; = lim 
dt—0 


Substitution of the term Z(t;) results in the equation: 





Vx, fexp ( 1 §(q;) — MWD Jog (2ndth) ) dr, 
u, = lim | AR~'G/ — a 
di=0 Jexp( 1 §(q;) — MD og (2ndth) dt; 








Next we are using standard properties of the exponential function that lead to: 


l Mase Vx, [s exp (—+S(t;)) exp — log (2ndrd) Jd 
u= n AR Ge 7 LE ANI 
exp (—75(t;)) exp (-2 log (2ndtd) ) a 





The term exp (- ANI log (2ndt’)) does not depend on the trajectory 1;, therefore it can be taken 


outside the integral as well as outside the gradient. Thus we will have that: 


i E exp (- UE log (2ndr’) Vx, [S exp (— 75(ti))dti| 
u; = lim — 
t dt—0 t exp _AN-iì)l log 2mndtÀ exp —1ş Ti dt; 

2 x 





The constant term drops from the nominator and denominator and thus we can write: 


Vx, J exp ee 
fexp (—48(t;)) dt; ` 


Under the assumption that term exp (- is (ti) ) dt; is continuously differentiable in x,, and dt we 
can change order of the integral with the differentiation operations. In general for Vy f f(x, y)dy = 
J Vx (x,»)dy to be true, f(x,t) should be continuous in y and differentiable in x. Under this as- 
sumption, the optimal controls can be further formulated as: 





dt—0 


u; = lim (i 








i rS Yx, exp (—45(ti)) dt; 
"o fexp (—48(t;)) dt; 
Application of the differentiation rule of the exponent results in: 


_, 2 f exp (—45(ti)) Vx, (—45(t)) dt; 
Gi, fexp (—48(t;)) dt; 








u; = lim 
dt>0 
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The denominator is a function of x, the current state and thus it can be pushed inside the integral 
of the nominator: 


u; = lim 
dt—0 








-IQT exp (~;5(i)) le. l 
AR G; To aa ( 50) il 


_15 (7; 
By defining the probability P (t;) = ae the expression above can be written as: 
A L I 


— y; -1 QaT Sif lg i ; 
u, = lim aR G7 f P (ti) Vx, (-550)) an. 


Further simplification will result in: 


u, = lim [-Rte? f pt) V, Saan: ; 
dt—0 $ ? 


We know that the control transition matrix has the form G(x,,)’ = [07 G, (Xx, )"]. In addition 


the partial derivative Vy, S (t;) can be written as Vx, (t;)? = [Vim (ti)! V os(t)"]. By using 
these equations we will have that: ' ' 


u, = lim (R10 617) f p) 
dt—0 


The equation above can be written in the form: 





u, = lim (0 R67] f P(T) 
t—0 





or 


u, = lim (0 RG? 
dt0 





Therefore we will have the result 


u; = lim [RG [p(n © a : 
dt0 Xr; 


Lemma 2 : Given the stochastic dynamics and the cost in (1),(2),(3) and(4) the gradient of the 
path function S(t;) with respect to the directly actuated part of the state xis formulated as: 


lim (v oS) =H! (Gie, —b,,) 
dt0 x : i 
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T 
where the function b(x,,) defined as MH(x;,,)®,, with H, = GORGO and the quantity ®, € R!*! 
is expressed as: 


trace | H; ' d (co) Hi; 
ti 


a 


a H,! d (et) Hy 


trace (H, = aa, ) | 
Proof 


We are calculating the term Vos (To) . More precisely we have shown that 


(c) (c) iN- 1 


z N-1 1 N=! ae =x; 
S(t) =Oy + È} qdt+5 $ || as ‘lh, dt+ > ‘y log |H;,|. 
jai 2 i= dt 2 5 


1 


To limit the length of our derivation we introduce the notation Y, = Off, h, Os, and Oy, = 


c 
—Xx;. 


(c) 
ae — 27 — tiar) and it is easy to show that || ae — pa li, dt = h, and therefore we will 


have: 
N-1 


S(ti) = Ory + m LW LO +5 Y bot 


In the analysis that follows we provide the derivative of the 1th, 2th and 4th term of the cost 
function. We assume that the cost of the state during the time horizon Q, = 0. In cases that this 
is not true then the derivative V © Er Q,dt needs to be found as well. By calculating the term 

ti 


V xO \Õ(To) we can find the local controls u(t;). It is important to mention that the derivative of the 


path cost S(t;) is taken only with respect to the current state x,,. 
The first term is: 


V0 (Oty) = 0 


DERIVATIVE OF THE 2TH TERM V ¢ o [za L? Yn] OF THE COST S(t;). 


The second term can be found as follows: 
1 Al 
olan L ‘| i 
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The operator V œo is linear and it can massaged inside the sum: 
to 


dt py vo (Ya) s 


Terms that do not depend on x drop and thus we will have: 


1 


Dat * OVa- 


Substitution of the parameter Y; = af H, ' , will result in: 


"A o [a a; H; 1 O,,| - 


By making the substitution B,, = H;' o, and applying the rule V (u(x)’ v(x)) =V (u(x)) v(x) + 
V (v(x)) u(x) we will have that: 


1 


di Poa, B, +VoB,, | (40) 


Next we find the derivative of 0,,: 
V (90, = Vw [x —x —£,(x, dt 
x® ti x ti+1 ti c\AT; A 
and the result is 
V oQ; = jx, — V wf dt. 
Zr : Bas 
We substitute back to (40) and we will have: 


1 
2dt 


1 
-zm (maY i Par) Bit ag Yao Or 


After some algebra the result of V o (za a ye a Yr) is expressed as: 


|- (tat Yy fe ar) Br + Vc o By Ar|- 


1 ly 
~ 2dt hisy ? 8B, + s 


The next step now is to find the limit of the expression above as dt — 0. More precisely we will 
have that: 


Vx B; Olr. 


1 lg 


lim |— sa e dBi Ql. 
asol 2dt B, ay pyk Vo, ti 


m 
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LIMIT OF THE FIRST SUBTERM: — z}; B,, 


We will continue our analysis by finding the limit for each one of the 3 terms above. The limit of 
the first term is calculated as follows: 


1 1 
pier (i 6, = -iala H, o ) 
=-._ H7! lima, 
dt-0 


; ti 


L i ae ORONO) 
=-5 H; ' Hit (62 -x ram 


LIMIT OF THE SECOND SUBTERM: -iV of fe ) B, 


The limit of the second term is calculated as follows: 


. /1 . 
Eg (apf B) =—2% ph) J, 


The limit of the term limgr—0o Qy; is derived as: 


(o) _,© _ (c) (c) TA = 
Jim (x2, —X;, fe(x,,)dt) = — jim eo — X;, ) — Jim fe(x;,)dt =0-0=0. 


ral 
LIMIT OF THE THIRD SUBTERM: 57 Vo By, Or 


Finally the limit of the third term can be found as: 


1 
lim, (r Vao Pa on ) = 
1 
= lim V ; li o | = 
ao a oB ilaa os ) 


Ç Cc. 1 C 
= lim V obe 5 l lim (2 x) fh ne 


We substitute B, = H; ' , and write the matrix H, l in row form: 
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= lim V_@ (H;' Ou) 


dt—0 Xt; 





| a0” 
go Oy, 
-T 
H” Oy, 
= be () _ @ 1 _ ge) 
= aa fim, (oe, get 


We again use the rule V (u(x)? v(x)) = V (u(x)) v(x) + V(v(x)) u(x) and thus we will have: 


T 
ST -T 
(v Hí 1) Ou, + Vo) Ou, Hi” ) 


i 


T 
-T -T 
| (von? Or, + V (0) Or H”) ) 





= lim 
dt0 | 





T 
l =F l —T 


We can split the matrix above into two terms and then we pull out the terms œ, and Vi Ar 
Ti 
respectively : 
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1 T 
V oH ; Hy)” 
2 a 
Ms oH,” HO : 
oF T l ! T . QEON G) 
= Jim, | | + | | Poo! zim, (2 Er 6) 


. £ i ee 
V oH” i | HÜ 


1 : 1 
4 T 1 1 T : (c) _ glo) (c) 
= jim (a VoH; +H, voar) 5 lim («x! ao) T =f ) 





1 A l 
[xy T -1 -1 y; T . (c) _ (0) (c) 
= ( lim (at) V oH, +H, lim (Vor) ) 5 lim, (ei x Mai =f ) 


Since limg;_509 (a7 ) = 0x; and limg;_59 (v Oy, ) = —I},,; the final result is expressed as fol- 
ti 


lows 


e E -H Lim (&® -xl _ 
lim Gaza o ) =-H, Sa E X; l fa J- 


After we have calculated the 3 sub-terms, the 2th term of the of the derivative of path cost S (to) 


can be expressed in the following form: 
case -1 tim (ig) x1 _ 0 
Vo (at 2 w) = -H, Jim (a, — X; a — i; )) : 


DERIVATIVE OF THE FOURTH TERM V {9 =; log IH.) OF THE COST S(t;). 
The analysis for the 4th term is given below: 
A A 
Vo (3 L log |H;,| } = 7x0 log |H, |. 
j=i 


If we assume that x) = xen x) og) and take the derivative with respect to each element we 


will have 
À à 1 

0 «i | —log|H,| | = ~-—— d alH]. 

wo (5 og i) 2 B, „(e [Hi 


i w i 
9o ($102) = 3 AG,) 


A 
ded (žr |H, ) = zirace (1, l aE, ) ; 
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Where we make used of the identity ddet(A) = det(A)Tr (A~'0dA). The result is expressed as: 


trace H,! 0. (e, H | 


trace H,! de) H, 


A 
V0 G log |H, | F 


Pr trace 9e H, ) | 


or in a more compact form: 


A — 
Vi (F toe = H; 'b;. 
where b(x,) = AH(x;,)®;, and the quantity ®, € RK’! is defined as: 


trace | H; ; d ct) iy; 
ti 

| trace H,' d (co) Hh, | 
ti 


1 


| trace (1, l att) | 


Since we computed all the terms of the derivative of the path cost (to) and after putting all the 
terms together we have the result expressed as follows: 


2 ORON _ ge) 
Jim, (V30) =H" (yim, (a, g) —bs 


By taking into account the fact that limy;_,0 (e. — x) 4- t) = Ge, we get the follow- 


ing final expression: 





. & -1/QM,. _ 
lim (v S(t ))= _H, (Gi £, b; ) - 


i 


Theorem 3 : The optimal control solution to the stochastic optimal control problem expressed 
by (1),(2),(3),(4) is formulated by the equation that follows: 


u,, = lim / p(t) uz (ti) dt, 
dt>0 
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Sas is a path depended probability distribution and the term u (ti) de- 


7 Soke 
fined as u; (ti) = R-IGOT OF (Gi RGT) (Gis, = bi.) are the local controls of each sam- 
pled trajectory starting from state x, The terms ¢, and b, are defined as €, = 

T 
(i) - x) 1 n) and b(x,) =AH(x,)®, with H, =GOR'G and ®, given in (41). 


where p(t) = 


Proof 
To prove the theorem we make use of the Lemma 2 and we substitute Va oŠ(t (ti) in the main 


result of Lemma 1. More precisely from lemma A1 we have that: 


u, = lim (repr f P (ti) H, | (vos) ani). 
dt—0 ti 


The terms R`! and G, can be pushed insight the integral since they are independent of t; = 
(X1,X2,..-,Xv). Thus we have the expression: 


u, = lim ( [paR epn (vy S(t “)) a). 
dt-0 ae 


z (dt) 
u; = lim P(t) u uz (Ti) dti, 


(dt 


where the local controls uz, (zi) are given as follows: 
us”? (ti) =R- 1 Go? TH; 1 Vio) S(t) 5 


After applying the limit, and making use of the result in Lemma 2 the equation above is ex- 
pressed as: 


u, = pa i) Uz (Xi. Xy dT, 


where the local controls uz (X;,,,,X;,) are given as follows: 





E C nr . C C. 1 (3 
w(t) = w(x) = RIGET (tim, (x) w) g) bs). 


or in a simpler form: 
U(X, Xr) = R'G07TH,! (G.€;; = b;;) : 


By substituting with H(x;,) = GOR 1GeT we have the final result: 


yr\—! fale 
UL(Ti) = Uz (X,,,,X1,) = RG 07 (cP RGP") (Gi de, —b,,) i 
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Appendix B. 


Theorem 4 : Let us assume that x satisfies the SDE x = f(x,t) + G(x)e(t). Then ¥(x,to) = 
Wisin) =E (Ptn) eho AAE if and only if ¥(x,t) satisfies the backward Kolmogorov 
PDE: 


1 1 
9Y, = -74 +£7 (VY) + ztrace ((Vxx¥;)G;ZeG7), 


with boundary condition: 
1 
Yy) = exp (=7000) ); 


Proof Given that x satisfies the SDE x = f(x,t) + G(x)e(t) and ¥(x,t) satisfies the PDE above, 
application of Ito lemma (Øksendal, 2003) to function Y (t) = ¥(x,,t) elo 29)4T Jeads to the final 
result ¥(x, to) = E (Ptn) eho “ade, This result is the solution of the linear PDE. 

a 
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